As part of the welcomed journey that has been the Nanodegree in Datascience for 2020, we are taken upon the quest of delivering a project that attests the buildup of our skills. This project will focus on Credit Risk and interpretable machine learning applications for credit professionals that are involved in the field of Data Science.
The source of our data is the historical Kaggle competition "Give me some Credit". The challenge regards in classifiying if a client will default in the payment of his financial obligations in a timeframe of 2 years.A welcome and sound challenge for joining professionals in the field.
Dataset Available on:https://www.kaggle.com/c/GiveMeSomeCredit/overview
Our main goal will be to implement MLI techniques (Mainly LIME) to results of a full Machine Learning process. MLI techniques are paramount to understand the record-by-feature granularity required from regulatory bodies across the globe. Furthermore, MLI techniques help to detect model bias on ethically dubious features.
Being able to fully understand and explain at the right level of granularity the results of a credit risk model, will help the community to grow tolerant in accepting the results of Ml models.
Metrics for this project should be:
Exploration of a dataset is paramount to a correct framing of our problem. Out objective is to predict an event of serious delinquency in the next 2 years based on 10 variables. In order to really understand our data and the vastness of the problem at hand we will focus on the next steps:
The Give Me Some Credit dataset has a recollection of important features regarding the potential credit risk of the Client. We have a total of 150.000 different clients available to user in our training excercise. The feature space for each client ranges from financial standing features, financial status to demographic.
Descriptive statistics of the dataset are as follows
data.describe()
The Give Me Some credit competition support team kindly offered a dictionary for this dataset that is summarized for reader support
Regarding theoretical limits for variables, Utilization cannot be negative (unless overdraft). High Debt to Income Ratios should not be processed as outliers, when income becomes 0 debt to income limit approaches infinity. Negative incomes are data errors, possible outliers regarding demographic data
We will start visualization plots always taking into account our target feature: Loan Default in the next 2 years, as it is a part of a bank livelihood, credit default rates should be consistently imbalanced.
We have a total of 6.6% of the 150k population that has incurred in serious default events in two years. Given the nature of the feature and the dataframe I suspect this information is collected from a Mortgage portfolio.
The distribution of the target feature is presented as follows:
pd.DataFrame(data['SeriousDlqin2yrs'].value_counts(normalize=True))
fig = plt.figure()
ax = fig.add_axes([0,0,1,1])
labels = ['Default', 'Non-Default']
values = [(data[data['SeriousDlqin2yrs']==1].sum()['SeriousDlqin2yrs'])/150000, (150000-data[data['SeriousDlqin2yrs']==1].sum()['SeriousDlqin2yrs'])/150000]
ax.bar(labels,values, color= ['red', 'green'], alpha=0.7)
plt.title('Figure 1. Distribution of Target Feature: SeriousDlqin2yrs')
plt.show()
As part of Portfolio theory, the goal of a risk manager is to select the optimal return for the associated risk on the investment. The bank provides loan applicants with the funds required for an especific purpose and in return it gets the interest on the money lent.[0]
A correct measurement of risk allows the bank to maximize its returns with an expected level of delinquency on its loans the goal of our model is to predict as accurate as possible this suggested level of delinquency in the next two years.
In addition to the target feature analysis, we provide expected signs for our feature space as well as visualizations for each feature in Annex(A).CreditDefaultEDA
The results from our EDA analysis are summarized on this table. All of this results come from a main question: What worries me about this feature?
| Feature Name | Expected Sign | Outlier Problems | Data Problems |
|---|---|---|---|
| SeriousDlqin2yrs | NA | NA | Unbalanced |
| RevolvingUtilizationOfUnsecuredLines | - | NA | Values > 1 suggest problems. Informative |
| Age | + | NA | Remove underage population |
| NumberOfTime30-59DaysPastDueNotWorse | - | NA | Nan Treatment, encode Values |
| DebtRatio | - | NA | NaN not associated to default, non reliable |
| MonthlyIncome | + | NA | NaN treatment, standarize |
| NumberOfOpenCreditLinesAndLoans | - | Cap at 10 | Standarize |
| NumberOfTimes90DaysLate | - | Encode, Relevant feature, possible leak | |
| NumberRealEstateLoansOrLines | + | ||
| NumberOfTime60-89DaysPastDueNotWorse | - | Encode | |
| NumberOfDependents | - |
The logic behind expected signs is as follows. Users in financial distress tend to start using their available resources to cover the leak in their financial distress. A high utilization at high rates suggest a troublesome set.For example, A low level of liquidity, low financial knowledge sophistication ( Highest borrowing rates in the market) or non expected financial windfall, among others.
As Warren buffet once said: " You cannot go through life borrowing at those rates" [1]
The logic behind age comes from dynamic Macroeconomic theory and the financial lifecycle. We expect (segregated by income) a Solow's type of income stock (Savings) were the bulk of the individual's life expenditure diminishes as life progresses,and income grows until it becomes static at retirement. [2]
Debt to income ratios suggest how much the incurred loans from the client are a burden to its day to day life, accurate income measuring its a problem in itself but when a client report 0 income to the bank, it is in the bank's best interest to offer reestructuring facilities to weather the crisis to a client. [3]
In practice however, a low debt to income ratio paradoxically has been reported to be associated to a decrease in payment rates as well. Small amounts of debt seem to easily escape the to-do list of high income segments of population.
Regarding Past-due Events our logic relies on Merton's modelling for financial assets. The closest you have been to default should weight on how likely you are to default in the future. Or, the standard deviation from expected returns should be priced proportionally to the expected return. A simple but elegant solution to asset pricing. [4]
This type of analysis are not possible on high dimensional feature spaces but from time to time, this low dimentional feature spaces give way for some leniency on the effect of theory on the expected results from our model
Identify clear Nan Values, replace with median. Median preferred due to robustness on data, and feasible value. The current state of missing values on the training set is displayed in the table below
pd.DataFrame(data.isna().sum(), columns = ['Missing Values']) # Diagnose Dataframe
# Create Boolean Features
data['MonthlyIncome_na'] = data['MonthlyIncome'].isna()
data['NumberOfDependents_na'] = data['MonthlyIncome'].isna()
# Input median on problem features
data['MonthlyIncome'] = data['MonthlyIncome'].fillna(data['MonthlyIncome'].median())
data['NumberOfDependents'] = data['NumberOfDependents'].fillna(data['NumberOfDependents'].median())
# Check data
pd.DataFrame(data.isna().sum(), columns=['Missing Values'])
As part of outlier treatment of variables we will clip outliers according to the interquartile range of the population
# Encode Outliers
data['RevolvingUtilizationOfUnsecuredLines_out'] = np.where(data['RevolvingUtilizationOfUnsecuredLines']>=2.14, 1, 0)
data['age_out'] = np.where(data['age'] >= 90, 1, 0)
data['NumberOfTime30-59DaysPastDueNotWorse_out'] = np.where(data['NumberOfTime30-59DaysPastDueNotWorse'] > 13, 1, 0)
data['DebtRatio_out'] = np.where(data['DebtRatio'] > 2.94, 1, 0)
data['MonthlyIncome_out'] = np.where(data['MonthlyIncome'] > 23000, 1, 0)
data['NumberOfOpenCreditLinesAndLoans_out'] = np.where(data['NumberOfOpenCreditLinesAndLoans'] > 20, 1, 0)
data['NumberOfTimes90DaysLate_out'] = np.where(data['NumberOfTimes90DaysLate'] > 10, 1, 0)
data['NumberOfTime60-89DaysPastDueNotWorse_out'] = np.where(data['NumberOfTime60-89DaysPastDueNotWorse'] > 3, 1, 0)
After preprocessing the integrity of the information holds. No records were omitted during the cleaning process, just adjusting outliers to max values inside each feature distribution
The first approach towards model creation will be with a random Forest Classifier with the standard parameters assigned by scikit-learn.A Train/Test split at 70% of the population is defined to avoid overfitting. The model should optimize for classification metrics such as AUC, or the components of the F1-Score.
X = data[['RevolvingUtilizationOfUnsecuredLines',
'age',
'NumberOfTime30-59DaysPastDueNotWorse',
'DebtRatio',
'MonthlyIncome',
'NumberOfOpenCreditLinesAndLoans',
'NumberOfTimes90DaysLate',
'NumberOfTime60-89DaysPastDueNotWorse',
'NumberOfDependents'
]]
y = data['SeriousDlqin2yrs']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .30, random_state=42)
#--- memory consumed by train dataframe ---
mem = X_train.memory_usage(index=True).sum()
print("Memory consumed by training set : {} MB" .format(mem/ 1024**2))
print('\n')
#--- memory consumed by test dataframe ---
mem = X_test.memory_usage(index=True).sum()
print("Memory consumed by test set : {} MB" .format(mem/ 1024**2))
Before training our models, a quick memory sanity check is in order. The split for the training set consumes 8MB, and the Test split 3.43MB. Given the order of complexity of the models to be used, no memory shortage realted problems should arise.
Now, regarding the model, we will train a industry scikit-learn standard parameters model. More information available at https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
A pipeline object is defined in case further preprocessing techniques should arise in the future. The reported AUC score regarding this model is 0.8405 the confusion matrix is displayed below
pipeline = Pipeline([
('classifier', RandomForestClassifier(random_state = 42))
])
pipeline.fit(X_train, y_train)
preds = pipeline.predict(X_test)
np.mean(preds == y_test)
accuracy_score(preds, y_test)
print(classification_report(y_test, preds))
print('AUC Score of: ', roc_auc_score(y_test, pipeline.predict_proba(X_test)[:,1]))
We implement GridSearch cross validation as implemented on Sklearn.model_selection, the chosen grid of hyperparameters are defined as follows:
Even a small change upon the standard chosen hyperparameters help in the classification metrics as it can be seen on the ROC plot below.
from sklearn.model_selection import GridSearchCV
hyperparameters = { 'classifier__n_estimators': [20, 30, 50],
'classifier__max_depth': [2, 4],
'classifier__min_samples_leaf': [2, 4]
}
clf = GridSearchCV(pipeline, hyperparameters, cv = 3)
# Fit and tune model
clf.fit(X_train, y_train)
preds = clf.predict(X_test)
np.mean(preds == y_test)
rfc_no_cv = plot_roc_curve(pipeline, X_test, y_test)
rfc_cv = plot_roc_curve(clf, X_test, y_test, ax=rfc_no_cv.ax_)
rfc_cv.figure_.suptitle("ROC curve comparison")
The now refined model displays better AUC.
print(classification_report(y_test, preds))
The model parameters are not the main objective of this article, nonetheless, the model displays a precision of 0.94 across the test set, with areas under the receiving operator curve of 0.85.
The relatively low f1 score relies entirely on the use-case. If this model is used entirely for accepting/rejecting loans, extreme care should be taken of not excluding a large percentage of the candidate population.
A common approach at explaining the results of a non-linear model, relies on plotting the Variable Importance for the whole population we will follow but also expand on the caveats of this approach.
pipeline.steps[0][1].feature_importances_
clf.best_estimator_.steps[0][1].feature_importances_
feature_imp = pd.Series(clf.best_estimator_.steps[0][1].feature_importances_,index=X.columns).sort_values(ascending=False)
# Creating a bar plot
sns.barplot(x=feature_imp, y=feature_imp.index, color="salmon")
# Add labels to your graph
plt.xlabel('Feature Importance Score')
plt.ylabel('Features')
plt.title("Visualizing Important Features")
plt.legend()
plt.show()
Altough variable Importance shows us an important insight, regulators require granularity regarding client offerings/refusals. In order to comply to the most strict regulations we need to be able to provide reason codes for each client refused by the bank. Lets take the worst top 10 predicted probabilities on our test set.
print(pd.DataFrame(clf.predict_proba(X_test)).head(5))
type(X_test)
X_test_prob_df = pd.DataFrame(clf.predict_proba(X_test))
X_test_df = X_test.reset_index()
X_test_merge = X_test_df.merge(X_test_prob_df, left_index=True, right_index = True)
top10_worst_clients = X_test_merge.sort_values(by=1, ascending = False).head(10)
top10_worst_clients
import lime
import lime.lime_tabular
predict_fn_rf = lambda x: clf.predict_proba(x).astype(float)
predict_fn_rf = lambda x: clf.predict_proba(x).astype(float)
X = X_train.values
explainer = lime.lime_tabular.LimeTabularExplainer(X,feature_names = X_train.columns,class_names=[0,1],kernel_width=5)
y_test.loc[[69189]]
choosen_instance = X_test.loc[[69189]].values[0]
exp = explainer.explain_instance(choosen_instance, predict_fn_rf,num_features=10)
exp.show_in_notebook(show_all=False)
choosen_instance = X_test.loc[[59128]].values[0]
exp = explainer.explain_instance(choosen_instance, predict_fn_rf,num_features=10)
exp.show_in_notebook(show_all=False)
choosen_instance = X_test.loc[[62640]].values[0]
exp = explainer.explain_instance(choosen_instance, predict_fn_rf,num_features=10)
exp.show_in_notebook(show_all=False)
choosen_instance = X_test.loc[[42081]].values[0]
exp = explainer.explain_instance(choosen_instance, predict_fn_rf,num_features=10)
exp.show_in_notebook(show_all=False)
As it has been observed, variable importance on classifiers is just not enough for the level of granularity required to convince regulators and managers of the benefits of using ML techniques in Credit Risk.
This has been a small example that an accurate, fully explained model is possible, under current economic/Finance theory and by explaining client by client the reasoning of the coded model, the benefits and potential insights across all of the studied population are inmense.
The benefits of using non-linear approaches to find clusters of explainable phenomena across a banks portfolio are exactly what is needed now in times of distress and global economic recovery.
To recapitulate what we've done we:
The latest part seemed the most difficult. LIME is still on its early releases, it is still not fully scalable and the GUI elements from the explainer are still a bit unflexible but the power of explaining at the record level is indeed impressive.
There are some caveats at linearizing the probability space across the joint distributions, the explanations are only as good as the model linearization is. Further research on the algorithm is still needed to fully scale the results for the tens of thousands of applications any small bank receives per day.
The H2O solution for these problems was the developement of K-Lime. Displaying the results of the R^2 for each cluster linearization. Other (also non scalable) technique for MLI is the use of Shapley Values, a game theoretical approach in finding the contribution of each feature to the added predicted probability.
Further improvements to be done on this implementation could be regarding the preprocessing pipeline. Feature Engineering (As long as it doesn't break interpretability) and a more extensive CV search across all possible and recurrent best parameters for classification techniques.
[0] https://www.investopedia.com/terms/m/modernportfoliotheory.asp
[1] https://www.cnbc.com/2020/05/13/warren-buffett-cautions-against-carrying-a-credit-card-balance.html#:~:text=%E2%80%9CYou%20can't%20go%20through,pay%20down%20debt%2C%20the%20better.&text=And%20once%20you're%20debt,have%20to%20owe%20interest%20again
[2] Swan, Trevor W. (November 1956). "Economic growth and capital accumulation". Economic Record. 32 (2): 334–361
[3] https://www.investopedia.com/terms/d/debtrestructuring.asp#:~:text=Debt%20restructuring%20is%20a%20process,a%20default%20on%20sovereign%20debt.